from lec_utils import *
from lec03_utils import *
from PIL import Image
Announcements 📣¶
- Homework 1 is due on Thursday.
Post on Ed or come to Office Hours for help!
- The Welcome Survey was due yesterday; please fill it out if you haven't already!
EECS 370 has the same midterm time as us. If you're in 370, sign up to take their alternate midterm exam the following day. - We released a 🎥 walkthrough video of the "area codes" dictionary example from Lecture 2.
- We also posted the solutions to the exercises from Lecture 2 and Discussion 1 in our public GitHub repository. Starting today, I'll post "blank" versions of lecture notebooks before class, and "filled" versions of notebooks with the code that I write after class, so that you can code along with me.
- Check out the new Resources tab on the course website, with links to lots of supplementary resources and past exams from similar classes!
Agenda¶
- Recap:
for-loops and dictionaries. numpyarrays.- Multidimensional arrays and linear algebra.
- Randomness and simulation.
Aside: Python Tutor¶
Python Tutor, linked on the Resources tab of the course website, visualizes the execution of Python code.
# The mystery example from Lecture 2.
from IPython.display import IFrame
IFrame(width="800",
height="500",
frameborder="0",
src="https://pythontutor.com/iframe-embed.html#code=def%20mystery%28vals%29%3A%0A%20%20%20%20vals%5B-1%5D%20%3D%2015%0A%20%20%20%20return%20vals.append%28'BBB'%29%0A%20%20%20%20%0Acreature%20%3D%20%5B1,%202,%203%5D%0A%0Amystery%28creature%29%0Amystery%28creature%29%0Amystery%28creature%29&codeDivHeight=400&codeDivWidth=350&cumulative=false&curInstr=14&heapPrimitives=nevernest&origin=opt-frontend.js&py=311&rawInputLstJSON=%5B%5D&textReferences=false")
Question 🤔 (Answer at practicaldsc.org/q)
How much progress have you made on Homework 1?
No judgement!
- A. I've submitted it!
- B. I've finished more than half of it.
- C. I've finished a few questions.
- D. I've looked at it, but haven't written any code yet.
- E. Haven't started at all.
Recap: for-loops and dictionaries¶
(source)
for-loops in Python¶
- In Python, you can loop over any iterable. Strings, lists, and dictionaries are all examples of iterables.
- All of the following are valid ways to write a
for-loop.
for value in "this is a string":
for element in lst: # Assume lst is a list.
for i in range(len(lst)):
- One of the more common
for-loop examples you may have seen in earlier classes involved performing some operation to every element of a sequence, e.g. doubling the numbers in a list.
def double(vals):
new_vals = []
for val in vals:
new_vals.append(vals * 2)
return new_vals
- We are going to avoid ❌ these kinds of
for-loops in this class, because there are much faster ways of achieving the same goal innumpyandpandas. We'll see these soon.
while-loops will come up sparingly.
But conceptually, you should know how they work!
List comprehension¶
In the situations when we do want to perform some operation to every element in a list, a common pattern is the list comprehension.
vals = [2, -1, 9, 4, 3, 8]
...
Ellipsis
...
Ellipsis
...
Ellipsis
Dictionaries¶
- A dictionary stores a collection of key-value pairs.
They are the equivalent of a map in C++.
{curly brackets}denote the start and end of a dictionary, a colon:is used to denote a single key value pair, and a comma,is used to separate key-value pairs.
dog = ...
dog
Ellipsis
- We retrieve a value in a dictionary using its key.
...
Ellipsis
...
Ellipsis
- After creation, we can add or change key-value pairs.
...
Ellipsis
dog
Ellipsis
- A dictionary's keys must be immutable (numbers, strings, Booleans), while its values can be anything.
# Here, we're trying to add a value with a key of [1, 2].
# Since [1, 2] is mutable, it can't be used as a key.
...
Ellipsis
Activity
Complete the implementation of the function find_anagrams, which takes in words, a list of strings, and returns a dictionary describing the anagrams present in words. Example behavior is given below.
>>> find_anagrams(['dog', 'hello', 'enlist', 'silent', 'a gentleman', 'god', 'elegant man', 'listen'])
{'dgo': ['dog', 'god'],
'ehllo': ['hello'],
'eilnst': ['enlist', 'silent', 'listen'],
' aaeeglmnnt': ['a gentleman', 'elegant man']}
example_words = ['dog', 'hello', 'enlist', 'silent', 'a gentleman', 'god', 'elegant man', 'listen']
def find_anagrams(words):
...
numpy arrays¶
Import statements¶
- We use
importstatements to add the objects (values, functions, classes) defined in other modules to our programs. There are a few different ways toimport.
Other terms I'll use for "module" are "library" and "package".
- Option 1:
import module.
Now, everytime we want to use a name in module, we must write module.<name>.
import math
...
Ellipsis
...
Ellipsis
- Option 2:
import module as m.
Now, everytime we want to use a name in module, we can write m.<name> instead of module.<name>.
# This is the standard way that we will import numpy.
import numpy as np
...
Ellipsis
...
Ellipsis
- Option 3:
from module import ....
This way, we explicitly state the names we want to import from module.
To import everything, write from module import *.
# Importing a particular function from the requests module.
from requests import get
# This typically fills up the namespace with a lot of unnecessary names, so use sparingly.
from math import *
sqrt
<function math.sqrt(x, /)>
NumPy¶
- NumPy (pronounced "num pie") is a Python library (module) that provides support for arrays and operations on them.
- The
pandaslibrary, which we will use for tabular data manipulation, works in conjunction withnumpy.
- To use
numpy, we need to import it. It's usually imported asnp(but doesn't have to be!)
We also had to install it on your computer first, but you already did that when you set up your environment.
import numpy as np
Arrays¶
- The core data structure in
numpyis the array. Moving forward, "array" will always refer to anumpyarray.
- One way to instantiate an array is to pass a list as an argument to the function
np.array.
...
Ellipsis
- Arrays, unlike lists, must be homogenous – all elements must be of the same type.
# All elements are converted to strings!
...
Array-number arithmetic¶
- Arrays make it easy to perform the same operation to every element without a
for-loop. This behavior is formally known as "broadcasting", but we often say these operations are vectorized.

temps = [68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62]
temps
[68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62]
temp_array = np.array(temps)
# Increase all temperatures by 3 degrees.
...
# Halve all temperatures.
...
# Convert all temperatures to Celsius.
...
- Note: In none of the above cells did we actually modify
temp_array! Each of those expressions created a new array. To actually changetemp_array, we need to reassign it to a new array.
temp_array
array([68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62])
temp_array = ...
# Now in Celsius!
temp_array
Ellipsis
⚠️ The dangers of unnecessary for-loops¶
- Under the hood,
numpyis implemented in C and Fortran, which are compiled languages that are much faster than Python. As a result, these vectorized operations are much quicker than if we used a vanilla Pythonfor-loop.
Also, the fact that arrays must be homogenous lend themselves to more efficient representations in memory.
- We can time code in a Jupyter Notebook. Let's try and square a long sequence of integers and see how long it takes with a Python loop:
%%timeit
squares = []
for i in range(1_000_000):
squares.append(i * i)
45.7 ms ± 302 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
- In vanilla Python, this takes about 0.04 seconds per loop. In
numpy:
%%timeit
squares = np.arange(1_000_000) ** 2
1.32 ms ± 33.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
- Only takes about 0.001 seconds per loop, more than 40x faster!
Element-wise arithmetic¶
- We can apply arithmetic operations to multiple arrays, provided they have the same length.
- The result is computed element-wise, which means that the arithmetic operation is applied to one pair of elements from each array at a time.

a = np.array([4, 5, -1])
b = np.array([2, 3, 2])
...
Ellipsis
...
Ellipsis
...
Ellipsis
arr = np.array([3, 8, 4, -3.2])
...
Ellipsis
...
Ellipsis
...
Ellipsis
...
Ellipsis
# An attribute, not a method.
...
Question 🤔 (Answer at practicaldsc.org/q)
What questions do we have about arrays so far?
Activity
🎉 Congrats! 🎉 You won the lottery 💰. Here's how your payout works: on the first day of September, you are paid \$0.01. Every day thereafter, your pay doubles, so on the second day you're paid \$0.02, on the third day you're paid \$0.04, on the fourth day you're paid \$0.08, and so on.
September has 30 days.
Write a one-line expression that uses the numbers 2 and 30, along with the function np.arange and at least one array method, that computes the total amount in dollars you will be paid in September. No for-loops or list comprehensions allowed!
*Note*: We have a 🎥 walkthrough video of this problem, but don't watch it until you've tried it yourself!
...
Ellipsis
Boolean filtering¶
- Comparisons with arrays yield Boolean arrays! These can be used to answer questions about the values in an array.
temp_array
Ellipsis
...
Ellipsis
- How many values are greater than or equal to 65?
...
Ellipsis
- What fraction of values are greater than or equal to 65?
...
Ellipsis
- Which values are greater than or equal to 65?
...
Ellipsis
- Which values are between 65 and 70?
# Note the parentheses!
...
# WRONG!
...
Note: & and | vs. and and or¶
- In Python, the standard symbols for "and" and "or" are, literally,
andandor.
if (5 > 3 and 'h' + 'i' == 'hi') or (-2 > 0):
print('success')
success
- But, when taking the element-wise and/or of two arrays, the standard operators don't work. Instead, use the bitwise operators:
&for "and",|for "or".
temp_array
Ellipsis
# Don't forget parentheses when using multiple conditions!
temp_array[(temp_array % 2 == 0) | (temp_array == temp_array.min())]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[56], line 2 1 # Don't forget parentheses when using multiple conditions! ----> 2 temp_array[(temp_array % 2 == 0) | (temp_array == temp_array.min())] TypeError: unsupported operand type(s) for %: 'ellipsis' and 'int'
- Read more about this here.
Multidimensional arrays and linear algebra¶
Multidimensional arrays¶
- A matrix can be represented in code using a two dimensional (2D) array.
- 2D arrays also resemble tables, or DataFrames, so it's worthwhile to study how they work.
nums = np.array([
[5, 1, 9, 7],
[9, 8, 2, 3],
[2, 5, 0, 4]
])
nums
array([[5, 1, 9, 7],
[9, 8, 2, 3],
[2, 5, 0, 4]])
# nums has 3 rows and 4 columns.
...
- In addition to creating 2D arrays from scratch, we can also create 2D arrays by reshaping other arrays.
# Here, we're asking to reshape np.arange(1, 7)
# so that it has 2 rows and 3 columns.
a = ...
a
Ellipsis
Operations along axes¶
- In 2D arrays (and DataFrames), axis 0 refers to the rows (up and down) and axis 1 refers to the columns (left and right).

a
Ellipsis
- If we specify
axis=0,a.sumwill "compress" along axis 0.
a.sum(axis=0)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[61], line 1 ----> 1 a.sum(axis=0) AttributeError: 'ellipsis' object has no attribute 'sum'
- If we specify
axis=1,a.sumwill "compress" along axis 1.
a.sum(axis=1)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[62], line 1 ----> 1 a.sum(axis=1) AttributeError: 'ellipsis' object has no attribute 'sum'
Selecting rows and columns from 2D arrays¶
- You can use
[square brackets]to slice rows and columns out of an array, too.
- The general convention is:
array[<row positions>, <column positions>]
a
Ellipsis
# Accesses row 0 and all columns.
...
# Same as the above.
...
# Accesses all rows and column 1.
...
# Access all rows and columns 0 and 2.
a[:, [0, 2]]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[67], line 2 1 # Access all rows and columns 0 and 2. ----> 2 a[:, [0, 2]] TypeError: 'ellipsis' object is not subscriptable
# Accesses row 0 and columns 1 and onwards.
...
Activity
Suppose we run the cell below.s = (5, 3)
grid = np.ones(s) * 2 * np.arange(1, 16).reshape(s)
grid[-1, 1:].sum()
What is the output of the cell? Try and answer without writing any code.
...
Ellipsis
Linear algebra review¶
- Arrays are used to perform computation in the context of linear algebra! For example, let's work through Practice Question 8.1 from LARDS, but this time using code.
Consider the vectors $\vec u$ and $\vec v$, defined below:
$$\vec u = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \qquad \vec v = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}$$
show_projection_plot()
We define $X \in \mathbb{R}^{3 \times 2}$ to be the matrix whose first column is $\vec u$ and whose second column is $\vec v$.
u = np.array([1, 0, 0]).reshape(-1, 1)
v = np.array([0, 1, 1]).reshape(-1, 1)
u
array([[1],
[0],
[0]])
v
array([[0],
[1],
[1]])
X = ...
X
Ellipsis
Show that:
$$(X^TX)^{-1}X^T = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \frac{1}{2} & \frac{1}{2} \end{bmatrix}$$...
Ellipsis
Find scalars $a$ and $b$ such that $a \vec u + b \vec v$ is the vector in $\text{span}(\vec u, \vec v)$ that is as close to $\vec y = \begin{bmatrix} 4 \\ 2 \\ 8 \end{bmatrix}$ as possible.
...
Ellipsis
Let $\vec e = \vec y - (a \vec u + b \vec v)$, where $a$ and $b$ are the values you found above. You should notice that the sum of the entries in $\vec e$ is 0. Why?
...
Ellipsis
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have about multidimensional arrays and our linear algebra review?
Example: Image processing¶
- It turns out that images can be represented as 3D
numpyarrays.
(image source)- The color of each pixel can be described with three numbers under the RGB model – a red value, green value, and blue value. Each of these can vary from 0 to 1.
img = np.asarray(Image.open('imgs/junior.jpeg')) / 255
img
array([[[0.38, 0.24, 0.18],
[0.35, 0.22, 0.15],
[0.35, 0.22, 0.16],
...,
[0.48, 0.31, 0.23],
[0.49, 0.31, 0.24],
[0.5 , 0.32, 0.25]],
[[0.38, 0.24, 0.17],
[0.35, 0.22, 0.15],
[0.35, 0.22, 0.16],
...,
[0.49, 0.31, 0.24],
[0.49, 0.31, 0.24],
[0.5 , 0.31, 0.24]],
[[0.37, 0.23, 0.16],
[0.35, 0.22, 0.15],
[0.35, 0.22, 0.16],
...,
[0.49, 0.31, 0.24],
[0.49, 0.31, 0.24],
[0.49, 0.31, 0.24]],
...,
[[0.35, 0.2 , 0.04],
[0.33, 0.18, 0.03],
[0.32, 0.16, 0.01],
...,
[0.37, 0.27, 0.13],
[0.39, 0.29, 0.15],
[0.42, 0.32, 0.18]],
[[0.34, 0.19, 0.04],
[0.33, 0.18, 0.04],
[0.33, 0.18, 0.04],
...,
[0.36, 0.26, 0.12],
[0.39, 0.29, 0.15],
[0.44, 0.33, 0.2 ]],
[[0.37, 0.22, 0.07],
[0.36, 0.22, 0.07],
[0.37, 0.22, 0.08],
...,
[0.35, 0.25, 0.11],
[0.38, 0.28, 0.15],
[0.44, 0.34, 0.2 ]]])
img.shape
(2048, 1536, 3)
plt.imshow(img)
plt.axis('off');
Applying a greyscale filter¶
- One way to convert an image to greyscale is to average its red, green, and blue values.
mean_2d = ...
mean_2d
Ellipsis
# This is just a single red channel!
plt.imshow(mean_2d)
plt.axis('off');
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[82], line 2 1 # This is just a single red channel! ----> 2 plt.imshow(mean_2d) 3 plt.axis('off') File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/pyplot.py:3562, in imshow(X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, interpolation_stage, filternorm, filterrad, resample, url, data, **kwargs) 3541 @_copy_docstring_and_deprecators(Axes.imshow) 3542 def imshow( 3543 X: ArrayLike | PIL.Image.Image, (...) 3560 **kwargs, 3561 ) -> AxesImage: -> 3562 __ret = gca().imshow( 3563 X, 3564 cmap=cmap, 3565 norm=norm, 3566 aspect=aspect, 3567 interpolation=interpolation, 3568 alpha=alpha, 3569 vmin=vmin, 3570 vmax=vmax, 3571 origin=origin, 3572 extent=extent, 3573 interpolation_stage=interpolation_stage, 3574 filternorm=filternorm, 3575 filterrad=filterrad, 3576 resample=resample, 3577 url=url, 3578 **({"data": data} if data is not None else {}), 3579 **kwargs, 3580 ) 3581 sci(__ret) 3582 return __ret File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/__init__.py:1473, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs) 1470 @functools.wraps(func) 1471 def inner(ax, *args, data=None, **kwargs): 1472 if data is None: -> 1473 return func( 1474 ax, 1475 *map(sanitize_sequence, args), 1476 **{k: sanitize_sequence(v) for k, v in kwargs.items()}) 1478 bound = new_sig.bind(ax, *args, **kwargs) 1479 auto_label = (bound.arguments.get(label_namer) 1480 or bound.kwargs.get(label_namer)) File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/axes/_axes.py:5895, in Axes.imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, interpolation_stage, filternorm, filterrad, resample, url, **kwargs) 5892 if aspect is not None: 5893 self.set_aspect(aspect) -> 5895 im.set_data(X) 5896 im.set_alpha(alpha) 5897 if im.get_clip_path() is None: 5898 # image does not already have clipping set, clip to Axes patch File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/image.py:729, in _ImageBase.set_data(self, A) 727 if isinstance(A, PIL.Image.Image): 728 A = pil_to_array(A) # Needed e.g. to apply png palette. --> 729 self._A = self._normalize_image_array(A) 730 self._imcache = None 731 self.stale = True File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/image.py:692, in _ImageBase._normalize_image_array(A) 690 A = cbook.safe_masked_invalid(A, copy=True) 691 if A.dtype != np.uint8 and not np.can_cast(A.dtype, float, "same_kind"): --> 692 raise TypeError(f"Image data of dtype {A.dtype} cannot be " 693 f"converted to float") 694 if A.ndim == 3 and A.shape[-1] == 1: 695 A = A.squeeze(-1) # If just (M, N, 1), assume scalar and apply colormap. TypeError: Image data of dtype object cannot be converted to float
- We need to repeat
mean_2dthree times along axis 2, to use the same values for the red, green, and blue channels.np.repeatwill help us here.
# np.newaxis is an alias for None.
# It helps us introduce an additional axis.
np.arange(5)[:, np.newaxis]
array([[0],
[1],
[2],
[3],
[4]])
np.repeat(np.arange(5)[:, np.newaxis], 3, axis=1)
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
mean_3d = ...
mean_3d
Ellipsis
plt.imshow(mean_3d)
plt.axis('off');
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[86], line 1 ----> 1 plt.imshow(mean_3d) 2 plt.axis('off') File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/pyplot.py:3562, in imshow(X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, interpolation_stage, filternorm, filterrad, resample, url, data, **kwargs) 3541 @_copy_docstring_and_deprecators(Axes.imshow) 3542 def imshow( 3543 X: ArrayLike | PIL.Image.Image, (...) 3560 **kwargs, 3561 ) -> AxesImage: -> 3562 __ret = gca().imshow( 3563 X, 3564 cmap=cmap, 3565 norm=norm, 3566 aspect=aspect, 3567 interpolation=interpolation, 3568 alpha=alpha, 3569 vmin=vmin, 3570 vmax=vmax, 3571 origin=origin, 3572 extent=extent, 3573 interpolation_stage=interpolation_stage, 3574 filternorm=filternorm, 3575 filterrad=filterrad, 3576 resample=resample, 3577 url=url, 3578 **({"data": data} if data is not None else {}), 3579 **kwargs, 3580 ) 3581 sci(__ret) 3582 return __ret File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/__init__.py:1473, in _preprocess_data.<locals>.inner(ax, data, *args, **kwargs) 1470 @functools.wraps(func) 1471 def inner(ax, *args, data=None, **kwargs): 1472 if data is None: -> 1473 return func( 1474 ax, 1475 *map(sanitize_sequence, args), 1476 **{k: sanitize_sequence(v) for k, v in kwargs.items()}) 1478 bound = new_sig.bind(ax, *args, **kwargs) 1479 auto_label = (bound.arguments.get(label_namer) 1480 or bound.kwargs.get(label_namer)) File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/axes/_axes.py:5895, in Axes.imshow(self, X, cmap, norm, aspect, interpolation, alpha, vmin, vmax, origin, extent, interpolation_stage, filternorm, filterrad, resample, url, **kwargs) 5892 if aspect is not None: 5893 self.set_aspect(aspect) -> 5895 im.set_data(X) 5896 im.set_alpha(alpha) 5897 if im.get_clip_path() is None: 5898 # image does not already have clipping set, clip to Axes patch File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/image.py:729, in _ImageBase.set_data(self, A) 727 if isinstance(A, PIL.Image.Image): 728 A = pil_to_array(A) # Needed e.g. to apply png palette. --> 729 self._A = self._normalize_image_array(A) 730 self._imcache = None 731 self.stale = True File ~/miniforge3/envs/pds/lib/python3.10/site-packages/matplotlib/image.py:692, in _ImageBase._normalize_image_array(A) 690 A = cbook.safe_masked_invalid(A, copy=True) 691 if A.dtype != np.uint8 and not np.can_cast(A.dtype, float, "same_kind"): --> 692 raise TypeError(f"Image data of dtype {A.dtype} cannot be " 693 f"converted to float") 694 if A.ndim == 3 and A.shape[-1] == 1: 695 A = A.squeeze(-1) # If just (M, N, 1), assume scalar and apply colormap. TypeError: Image data of dtype object cannot be converted to float
Applying a sepia filter¶
We won't walk through this example in lecture, but it's here for your reference.
- Let's sepia-fy Junior!
(Image credits)
- From here, we can apply this conversion to each pixel.
sepia_filter = np.array([
[0.393, 0.769, 0.189],
[0.349, 0.686, 0.168],
[0.272, 0.534, 0.131]
])
# Multiplies each pixel by the sepia_filter matrix.
# Then, clips each RGB value to be between 0 and 1.
filtered = (img @ sepia_filter.T).clip(0, 1)
filtered
array([[[0.37, 0.33, 0.26],
[0.33, 0.3 , 0.23],
[0.33, 0.3 , 0.23],
...,
[0.47, 0.42, 0.32],
[0.47, 0.42, 0.33],
[0.49, 0.43, 0.34]],
[[0.36, 0.32, 0.25],
[0.33, 0.3 , 0.23],
[0.34, 0.3 , 0.24],
...,
[0.47, 0.42, 0.33],
[0.47, 0.42, 0.33],
[0.48, 0.43, 0.33]],
[[0.35, 0.31, 0.24],
[0.33, 0.3 , 0.23],
[0.34, 0.3 , 0.24],
...,
[0.47, 0.42, 0.33],
[0.47, 0.42, 0.33],
[0.48, 0.43, 0.33]],
...,
[[0.3 , 0.26, 0.21],
[0.27, 0.24, 0.19],
[0.25, 0.23, 0.18],
...,
[0.37, 0.33, 0.26],
[0.41, 0.36, 0.28],
[0.45, 0.4 , 0.31]],
[[0.29, 0.25, 0.2 ],
[0.27, 0.24, 0.19],
[0.27, 0.24, 0.19],
...,
[0.36, 0.32, 0.25],
[0.41, 0.36, 0.28],
[0.46, 0.41, 0.32]],
[[0.33, 0.29, 0.23],
[0.32, 0.29, 0.22],
[0.33, 0.3 , 0.23],
...,
[0.35, 0.31, 0.24],
[0.4 , 0.35, 0.27],
[0.47, 0.42, 0.33]]])
plt.imshow(filtered)
plt.axis('off');
Key takeaway: avoid for-loops whenever possible!¶
You can do a lot without for-loops, both in numpy and in pandas.
Randomness and simulation¶
np.random¶
The submodule np.random contains various functions that produce random results.
These use pseudo-random number generators to generate random-seeming sequences of results.
# Run this cell multiple times!
# Returns a random integer between 1 and 6, inclusive.
...
Ellipsis
# Returns a random real number between 0 and 1.
...
# Returns a randomly selected element from the provided list, 5 times.
...
# Returns the number of occurrences of each outcome
# in 12 trials of an experiment in which
# outcome 1 happens 60% of the time and
# outcome 2 happens 40% of the time.
...
Ellipsis
Simulations¶
- Often, we'll want to estimate the probability of an event, but it may not be possible – or we may not know how – to calculate the probability exactly.
e.g., the probability that I see between 40 and 50 heads when I flip a fair coin 100 times.
- Or, we may have a theoretical answer, and want to validate it using another approach.
In such cases, we can use the power of simulation. We can:
- Figure out how to simulate one run of the experiment.
e.g., figure out how to get Python to flip a fair coin 100 times and count the number of heads. - Repeat the experiment many, many times.
- Compute the fraction of experiments in which our event occurs, and use this fraction as an estimate of the probability of our event.
This is the basis of Monte Carlo Methods.
- Figure out how to simulate one run of the experiment.
- Theory tells us that the more repetitions we perform of our experiment, the closer our fraction will be to the true probability of the event!
Specifically, the Law of Large Numbers tells us this.
Example: Coin flipping¶
- Question: What is the probability that I see between 40 and 50 heads, inclusive, when I flip a fair coin 100 times?
- Step 1: Figure out how to simulate one run of the experiment.
e.g., figure out how to get Python to flip a fair coin 100 times and count the number of heads.
...
Ellipsis
...
Ellipsis
def num_heads():
...
num_heads()
- Step 2: Repeat the experiment many, many times.
outcomes = ...
for _ in range(10_000):
...
- Step 3: Compute the fraction of experiments in which our event occurs, and use this fraction as an estimate of the probability of our event.
px.histogram(outcomes)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /var/folders/pd/w73mdrsj2836_7gp0brr2q7r0000gn/T/ipykernel_57732/3497464906.py in ?() ----> 1 px.histogram(outcomes) ~/miniforge3/envs/pds/lib/python3.10/site-packages/plotly/express/_chart_types.py in ?(data_frame, x, y, color, pattern_shape, facet_row, facet_col, facet_col_wrap, facet_row_spacing, facet_col_spacing, hover_name, hover_data, animation_frame, animation_group, category_orders, labels, color_discrete_sequence, color_discrete_map, pattern_shape_sequence, pattern_shape_map, marginal, opacity, orientation, barmode, barnorm, histnorm, log_x, log_y, range_x, range_y, histfunc, cumulative, nbins, text_auto, title, template, width, height) 476 rectangular mark to visualize the 1D distribution of an aggregate 477 function `histfunc` (e.g. the count or sum) of the value `y` (or `x` if 478 `orientation` is `'h'`). 479 """ --> 480 return make_figure( 481 args=locals(), 482 constructor=go.Histogram, 483 trace_patch=dict( ~/miniforge3/envs/pds/lib/python3.10/site-packages/plotly/express/_core.py in ?(args, constructor, trace_patch, layout_patch) 2086 trace_patch = trace_patch or {} 2087 layout_patch = layout_patch or {} 2088 apply_default_cascade(args) 2089 -> 2090 args = build_dataframe(args, constructor) 2091 if constructor in [go.Treemap, go.Sunburst, go.Icicle] and args["path"] is not None: 2092 args = process_dataframe_hierarchy(args) 2093 if constructor in [go.Pie]: ~/miniforge3/envs/pds/lib/python3.10/site-packages/plotly/express/_core.py in ?(args, constructor) 1344 elif hasattr(args["data_frame"], "to_pandas_df"): 1345 args["data_frame"] = args["data_frame"].to_pandas_df() 1346 columns = args["data_frame"].columns 1347 else: -> 1348 args["data_frame"] = pd.DataFrame(args["data_frame"]) 1349 columns = args["data_frame"].columns 1350 elif df_provided: 1351 columns = args["data_frame"].columns ~/miniforge3/envs/pds/lib/python3.10/site-packages/pandas/core/frame.py in ?(self, data, index, columns, dtype, copy) 840 ) 841 # For data is scalar 842 else: 843 if index is None or columns is None: --> 844 raise ValueError("DataFrame constructor not properly called!") 845 846 index = ensure_index(index) 847 columns = ensure_index(columns) ValueError: DataFrame constructor not properly called!
...
Ellipsis
- This is remarkably close to the true, theoretical answer!
from scipy.stats import binom
binom.cdf(50, 100, 0.5) - binom.cdf(39, 100, 0.5)
0.5221945185847372
Question 🤔 (Answer at practicaldsc.org/q)
What questions do you have about our coin flipping simulation?
Can you think of a way to perform the same simulation without any for-loops?
Example: The Birthday Paradox¶
- There are ~80 students in the room right now. What are the chances at least 2 students share the same birthday?
- In general, how many people must be in a room such that there's a 50% chance that at least 2 students share the same birthday?
- Let's define a function,
estimated_probability, which takes in a class size,n, and returns the probability that in a class ofnstudents, at least 2 students share the same birthday.
def simulate_classroom(n):
# This helper function should take in a class size, n,
# and return True if a simulated classroom of size n
# has at least 2 students with the same birthday
# and False otherwise.
# This is not the most efficient solution, but works for now.
...
def estimated_probability(n):
...
...
Ellipsis
- With 80 students, it's almost certain that 2 share the same birthday!
What's the minimum class size we'd need for a 50% chance that at least 2 share the same birthday?
probs = [estimated_probability(n) for n in range(1, 51)]
(
px
.bar(x=range(1, 51),
y=probs,
title='Probability that at least 2 students share the<br>same birthday in a class of n students')
.update_xaxes(title='$n$')
.update_yaxes(title='Probability')
)
- Lower than you might think!
Next time¶
On Thursday, we'll begin our deep dive into pandas DataFrames.